<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
<head>
    <meta charset="utf-8">
    <meta name="citation_title" content="Neural Networks and Deep Learning">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2015">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <meta name="citation_fulltext_world_readable" content="">
    <link rel="icon" href="http://neuralnetworksanddeeplearning.com/nnadl_favicon.ICO" />
    <title>Neural networks and deep learning</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS": 
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">
    <link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="chap5.html">CHAPTER 5</a></h1>
  <h1 class="chapter_title"><a href="chap5.html">Why are deep neural networks hard to train?</a></h1></div><div class="section"><div id="toc"> 
<p class="toc_title"><a href="index.html">Neural Networks and Deep Learning</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">Using neural nets to recognize handwritten digits</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() { 
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">How the backpropagation algorithm works</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() { 
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">Improving the way neural networks learn</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() { 
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">A visual proof that neural nets can compute any function</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() { 
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">Why are deep neural networks hard to train?</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() { 
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a><a href="chap6.html">Deep learning</a><div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><a href="chap6.html#introducing_convolutional_networks"><li>Introducing convolutional networks</li></a><a href="chap6.html#convolutional_neural_networks_in_practice"><li>Convolutional neural networks in practice</li></a><a href="chap6.html#the_code_for_our_convolutional_networks"><li>The code for our convolutional networks</li></a><a href="chap6.html#recent_progress_in_image_recognition"><li>Recent progress in image recognition</li></a><a href="chap6.html#other_approaches_to_deep_neural_nets"><li>Other approaches to deep neural nets</li></a><a href="chap6.html#on_the_future_of_neural_networks"><li>On the future of neural networks</li></a></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() { 
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});  
});</script><p class="toc_not_mainchapter"><a href="sai.html">Appendix: Is there a <em>simple</em> algorithm for intelligence?</a></p><p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<p class="sidebar"> If you benefit from the book, please make a small
donation.  I suggest $5, but you can choose the amount.</p>

<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="hosted_button_id" value="5K9YAHR4X84RN">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>

<p class="sidebar">Alternately, you can make a donation by sending me
Bitcoin, at address <span style="font-size: 0.7em">1Kd6tXH5SDAmiFb49J9hknG5pqj7KStSAx</span></p>

<!--
<hr>

<p class="sidebar"> If you benefit from the book, please make a small
donation.  I suggest $3, but you can choose the amount.</p>

<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="encrypted" value="-----BEGIN PKCS7-----MIIHTwYJKoZIhvcNAQcEoIIHQDCCBzwCAQExggEwMIIBLAIBADCBlDCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20CAQAwDQYJKoZIhvcNAQEBBQAEgYAtusFIFTgWVpgZsMgI9zMrWRAFFKQqeFiE6ay1nbmP360YzPtR+vvCXwn214Az9+F9g7mFxe0L+m9zOCdjzgRROZdTu1oIuS78i0TTbcbD/Vs/U/f9xcmwsdX9KYlhimfsya0ydPQ2xvr4iSGbwfNemIPVRCTadp/Y4OQWWRFKGTELMAkGBSsOAwIaBQAwgcwGCSqGSIb3DQEHATAUBggqhkiG9w0DBwQIK5obVTaqzmyAgajgc4w5t7l6DjTGVI7k+4UyO3uafxPac23jOyBGmxSnVRPONB9I+/Q6OqpXZtn8JpTuzFmuIgkNUf1nldv/DA1mhPOeeVxeuSGL8KpWxpJboKZ0mEu9b+0FJXvZW+snv0jodnRDtI4g0AXDZNPyRWIdJ3m+tlYfsXu4mQAe0q+CyT+QrSRhPGI/llicF4x3rMbRBNqlDze/tFqp/jbgW84Puzz6KyxAez6gggOHMIIDgzCCAuygAwIBAgIBADANBgkqhkiG9w0BAQUFADCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wHhcNMDQwMjEzMTAxMzE1WhcNMzUwMjEzMTAxMzE1WjCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMFHTt38RMxLXJyO2SmS+Ndl72T7oKJ4u4uw+6awntALWh03PewmIJuzbALScsTS4sZoS1fKciBGoh11gIfHzylvkdNe/hJl66/RGqrj5rFb08sAABNTzDTiqqNpJeBsYs/c2aiGozptX2RlnBktH+SUNpAajW724Nv2Wvhif6sFAgMBAAGjge4wgeswHQYDVR0OBBYEFJaffLvGbxe9WT9S1wob7BDWZJRrMIG7BgNVHSMEgbMwgbCAFJaffLvGbxe9WT9S1wob7BDWZJRroYGUpIGRMIGOMQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFjAUBgNVBAcTDU1vdW50YWluIFZpZXcxFDASBgNVBAoTC1BheVBhbCBJbmMuMRMwEQYDVQQLFApsaXZlX2NlcnRzMREwDwYDVQQDFAhsaXZlX2FwaTEcMBoGCSqGSIb3DQEJARYNcmVAcGF5cGFsLmNvbYIBADAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAIFfOlaagFrl71+jq6OKidbWFSE+Q4FqROvdgIONth+8kSK//Y/4ihuE4Ymvzn5ceE3S/iBSQQMjyvb+s2TWbQYDwcp129OPIbD9epdr4tJOUNiSojw7BHwYRiPh58S1xGlFgHFXwrEBb3dgNbMUa+u4qectsMAXpVHnD9wIyfmHMYIBmjCCAZYCAQEwgZQwgY4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEWMBQGA1UEBxMNTW91bnRhaW4gVmlldzEUMBIGA1UEChMLUGF5UGFsIEluYy4xEzARBgNVBAsUCmxpdmVfY2VydHMxETAPBgNVBAMUCGxpdmVfYXBpMRwwGgYJKoZIhvcNAQkBFg1yZUBwYXlwYWwuY29tAgEAMAkGBSsOAwIaBQCgXTAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MDUxMzMyMTRaMCMGCSqGSIb3DQEJBDEWBBRtGLYvbZ45sWVegWVP2CuXTHPmJTANBgkqhkiG9w0BAQEFAASBgKgrMHMINfV7yVuZgcTjp8gUzejPF2x2zRPU/G8pKUvYIl1F38TjV2pe4w0QXcGMJRT8mQfxHCy9UmF3LfblH8F0NSMMDrZqu3M0eLk96old+L0Xl6ING8l3idFDkLagE+lZK4A0rNV35aMci3VLvjQ34CvEj7jaHeLpbkgk/l6v-----END PKCS7-----
">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>

-->

<hr>
<span class="sidebar_title">Sponsors</span>
<br/>

<a href="https://lambdalabs.com/?utm_source=neuralnetworksdeeplearning&utm_medium=banner&utm_campaign=blogin&utm_content=rbannerimg">
  <img src="assets/lambda.png" width="200px" style="padding: 3px 0px 0px 10px; border-style: none;">
</a>
<br>
<div style="line-height: 1.2; padding-bottom: 12px; font-size: 0.8;">
  <a href="https://lambdalabs.com/?utm_source=neuralnetworksdeeplearning&utm_medium=banner&utm_campaign=blogin&utm_content=rtext">Deep Learning Workstations, Servers, and Laptops</a>
</div>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='200px' style="padding: 5px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='200px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='210px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/> 

<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible, with
especial thanks to Pavel Dudrenov.  Thanks also to all the
contributors to the <a href="bugfinder.html">Bugfinder Hall of
Fame</a>.  </p>

<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>

<p class="sidebar"><a href="faq.html">Book FAQ</a></p>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="https://michaelnielsenupdates.substack.com/subscribe">Michael Nielsen's project announcement mailing list</a>
</p>

<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
Goodfellow, Yoshua Bengio, and Aaron Courville</p>

<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
By <a href="http://michaelnielsen.org">Michael Nielsen</a> / Dec 2019
</p>
</div>
</p><p>Imagine you're an engineer who has been asked to design a computerfrom scratch.  One day you're working away in your office, designinglogical circuits, setting out <CODE>AND</CODE> gates, <CODE>OR</CODE> gates,and so on, when your boss walks in with bad news. The customer hasjust added a surprising design requirement: the circuit for the entirecomputer must be just two layers deep:</p><p><center><img src="images/shallow_circuit.png" width="500px"></center></p><p>You're dumbfounded, and tell your boss: "The customer is crazy!"</p><p>Your boss replies: "I think they're crazy, too.  But what thecustomer wants, they get."</p><p>In fact, there's a limited sense in which the customer isn't crazy.Suppose you're allowed to use a special logical gate which lets you<CODE>AND</CODE> together as many inputs as you want.  And you're alsoallowed a many-input <CODE>NAND</CODE> gate, that is, a gate which can<CODE>AND</CODE> multiple inputs and then negate the output.  With thesespecial gates it turns out to be possible to compute any function atall using a circuit that's just two layers deep.</p><p>But just because something is possible doesn't make it a good idea.In practice, when solving circuit design problems (or most any kind ofalgorithmic problem), we usually start by figuring out how to solvesub-problems, and then gradually integrate the solutions.  In otherwords, we build up to a solution through multiple layers ofabstraction.  </p><p>For instance, suppose we're designing a logical circuit to multiplytwo numbers.  Chances are we want to build it up out of sub-circuitsdoing operations like adding two numbers.  The sub-circuits for addingtwo numbers will, in turn, be built up out of sub-sub-circuits foradding two bits.  Very roughly speaking our circuit will look like:</p><p><center><img src="images/circuit_multiplication.png" width="500px"></center></p><p>That is, our final circuit contains at least three layers of circuitelements.  In fact, it'll probably contain more than three layers, aswe break the sub-tasks down into smaller units than I've described.But you get the general idea.</p><p>So deep circuits make the process of design easier.  But they're notjust helpful for design.  There are, in fact, mathematical proofsshowing that for some functions very shallow circuits requireexponentially more circuit elements to compute than do deep circuits.For instance, a famous series of papers in the early1980s*<span class="marginnote">*The history is somewhat complex, so I won't give  detailed references.  See Johan Håstad's 2012 paper  <a href="http://eccc.hpi-web.de/report/2012/137/">On the correlation of    parity and small-depth circuits</a> for an account of the early  history and references.</span> showed that computing the parity of a setof bits requires exponentially many gates, if done with a shallowcircuit.  On the other hand, if you use deeper circuits it's easy tocompute the parity using a small circuit: you just compute the parityof pairs of bits, then use those results to compute the parity ofpairs of pairs of bits, and so on, building up quickly to the overallparity.  Deep circuits thus can be intrinsically much more powerfulthan shallow circuits.</p><p>Up to now, this book has approached neural networks like the crazycustomer.  Almost all the networks we've worked with have just asingle hidden layer of neurons (plus the input and output layers):</p><p><center><img src="images/tikz35.png"/></center></p><p>These simple networks have been remarkably useful: in earlier chapterswe used networks like this to classify handwritten digits with betterthan 98 percent accuracy!  Nonetheless, intuitively we'd expectnetworks with many more hidden layers to be more powerful:</p><p><center><img src="images/tikz36.png"/></center></p><p>Such networks could use the intermediate layers to build up multiplelayers of abstraction, just as we do in Boolean circuits.  Forinstance, if we're doing visual pattern recognition, then the neuronsin the first layer might learn to recognize edges, the neurons in thesecond layer could learn to recognize more complex shapes, saytriangle or rectangles, built up from edges.  The third layer wouldthen recognize still more complex shapes.  And so on.  These multiplelayers of abstraction seem likely to give deep networks a compellingadvantage in learning to solve complex pattern recognition problems.Moreover, just as in the case of circuits, there are theoreticalresults suggesting that deep networks are intrinsically more powerfulthan shallow networks*<span class="marginnote">*For certain problems and network  architectures this is proved in  <a href="http://arxiv.org/pdf/1312.6098.pdf">On the number of response    regions of deep feed forward networks with piece-wise linear    activations</a>, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio  (2014). See also the more informal discussion in section 2 of  <a href="http://www.iro.umontreal.ca/&#126;bengioy/papers/ftml_book.pdf">Learning    deep architectures for AI</a>, by Yoshua Bengio (2009).</span>.</p><p>How can we train such deep networks?  In this chapter, we'll trytraining deep networks using our workhorse learning algorithm -<a href="chap1.html#learning_with_gradient_descent">stochastic  gradient descent</a> by <a href="chap2.html">backpropagation</a>.  But we'llrun into trouble, with our deep networks not performing much (if atall) better than shallow networks.</p><p>That failure seems surprising in the light of the discussion above.Rather than give up on deep networks, we'll dig down and try tounderstand what's making our deep networks hard to train.  When welook closely, we'll discover that the different layers in our deepnetwork are learning at vastly different speeds.  In particular, whenlater layers in the network are learning well, early layers often getstuck during training, learning almost nothing at all.  This stucknessisn't simply due to bad luck.  Rather, we'll discover there arefundamental reasons the learning slowdown occurs, connected to our useof gradient-based learning techniques.</p><p>As we delve into the problem more deeply, we'll learn that theopposite phenomenon can also occur: the early layers may be learningwell, but later layers can become stuck.  In fact, we'll find thatthere's an intrinsic instability associated to learning by gradientdescent in deep, many-layer neural networks.  This instability tendsto result in either the early or the later layers getting stuck duringtraining.</p><p>This all sounds like bad news.  But by delving into thesedifficulties, we can begin to gain insight into what's required totrain deep networks effectively.  And so these investigations are goodpreparation for the next chapter, where we'll use deep learning toattack image recognition problems.</p><p><h3><a name="the_vanishing_gradient_problem"></a><a href="chap5.html#the_vanishing_gradient_problem">The vanishing gradient problem</a></h3></p><p>So, what goes wrong when we try to train a deep network?</p><p>To answer that question, let's first revisit the case of a networkwith just a single hidden layer.  As per usual, we'll use the MNISTdigit classification problem as our playground for learning andexperimentation*<span class="marginnote">*I introduced the MNIST problem and data  <a href="chap1.html#learning_with_gradient_descent">here</a> and  <a href="chap1.html#implementing_our_network_to_classify_digits">here</a>.</span>.</p><p>If you wish, you can follow along by training networks on yourcomputer.  It is also, of course, fine to just read along.  If you dowish to follow live, then you'll need Python 2.7, Numpy, and a copy ofthe code, which you can get by cloning the relevant repository fromthe command line:<div class="highlight"><pre><span></span>git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
</pre></div>
If you don't use <tt>git</tt> then you can download the data and code<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip">here</a>.  You'll need to change into the <tt>src</tt> subdirectory.</p><p>Then, from a Python shell we load the MNIST data:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
</pre></div>
</p><p>We set up our network:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
</pre></div>
</p><p>This network has 784 neurons in the input layer, corresponding to the$28 \times 28 = 784$ pixels in the input image.  We use 30 hiddenneurons, as well as 10 output neurons, corresponding to the 10possible classifications for the MNIST digits ('0', '1', '2', $\ldots$,'9').</p><p>Let's try training our network for 30 complete epochs, usingmini-batches of 10 training examples at a time, a learning rate $\eta= 0.1$, and regularization parameter $\lambda = 5.0$.  As we trainwe'll monitor the classification accuracy on the<tt>validation_data</tt>*<span class="marginnote">*Note that the networks is likely to  take some minutes to train, depending on the speed of your machine.  So if you're running the code you may wish to continue reading and  return later, not wait for the code to finish executing.</span>:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span> 
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>We get a classification accuracy of 96.48 percent (or thereabouts -it'll vary a bit from run to run), comparable to our earlier resultswith a similar configuration.</p><p>Now, let's add another hidden layer, also with 30 neurons in it, andtry training with the same hyper-parameters:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span> 
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>This gives an improved classification accuracy, 96.90 percent.  That'sencouraging: a little more depth is helping.  Let's add another30-neuron hidden layer:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span> 
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>That doesn't help at all.  In fact, the result drops back down to96.57 percent, close to our original shallow network.  And suppose weinsert one further hidden layer:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span> 
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>The classification accuracy drops again, to 96.53 percent.  That'sprobably not a statistically significant drop, but it's notencouraging, either.</p><p>This behaviour seems strange.  Intuitively, extra hidden layers oughtto make the network able to learn more complex classificationfunctions, and thus do a better job classifying.  Certainly, thingsshouldn't get worse, since the extra layers can, in the worst case,simply do nothing*<span class="marginnote">*See <a href="chap5.html#identity_neuron">this later    problem</a> to understand how to build a hidden layer that does  nothing.</span>.  But that's not what's going on.</p><p>So what is going on?  Let's assume that the extra hidden layers reallycould help in principle, and the problem is that our learningalgorithm isn't finding the right weights and biases.  We'd like tofigure out what's going wrong in our learning algorithm, and how to dobetter.</p><p>To get some insight into what's going wrong, let's visualize how thenetwork learns.  Below, I've plotted part of a $[784, 30, 30, 10]$network, i.e., a network with two hidden layers, each containing $30$hidden neurons.  Each neuron in the diagram has a little bar on it,representing how quickly that neuron is changing as the networklearns.  A big bar means the neuron's weights and bias are changingrapidly, while a small bar means the weights and bias are changingslowly.  More precisely, the bars denote the gradient $\partial C/ \partial b$ for each neuron, i.e., the rate of change of the costwith respect to the neuron's bias.  Back in<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter  2</a> we saw that this gradient quantity controlled not just howrapidly the bias changes during learning, but also how rapidly theweights input to the neuron change, too.  Don't worry if you don'trecall the details: the thing to keep in mind is simply that thesebars show how quickly each neuron's weights and bias are changing asthe network learns.</p><p>To keep the diagram simple, I've shown just the top six neurons in thetwo hidden layers.  I've omitted the input neurons, since they've gotno weights or biases to learn.  I've also omitted the output neurons,since we're doing layer-wise comparisons, and it makes most sense tocompare layers with the same number of neurons.  The results areplotted at the very beginning of training, i.e., immediately after thenetwork is initialized.  Here they are*<span class="marginnote">*The data plotted is  generated using the program  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/generate_gradient.py">generate_gradient.py</a>.  The same program is also used to generate the results quoted later  in this section.</span>:</p><p><canvas id="initial_gradient" width=470 height=620></canvas></p><p>The network was initialized randomly, and so it's not surprising thatthere's a lot of variation in how rapidly the neurons learn.  Still,one thing that jumps out is that the bars in the second hidden layerare mostly much larger than the bars in the first hidden layer.  As aresult, the neurons in the second hidden layer will learn quite a bitfaster than the neurons in the first hidden layer.  Is this merely acoincidence, or are the neurons in the second hidden layer likely tolearn faster than neurons in the first hidden layer in general?</p><p>To determine whether this is the case, it helps to have a global wayof comparing the speed of learning in the first and second hiddenlayers.  To do this, let's denote the gradient as $\delta^l_j= \partial C /\partial b^l_j$, i.e., the gradient for the $j$th neuron in the $l$thlayer*<span class="marginnote">*Back in  <a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter    2</a> we referred to this as the error, but here we'll adopt the  informal term "gradient".  I say "informal" because of course  this doesn't explicitly include the partial derivatives of the cost  with respect to the weights, $\partial C / \partial w$.</span>.  We canthink of the gradient $\delta^1$ as a vector whose entries determinehow quickly the first hidden layer learns, and $\delta^2$ as a vectorwhose entries determine how quickly the second hidden layer learns.We'll then use the lengths of these vectors as (rough!)  globalmeasures of the speed at which the layers are learning.  So, forinstance, the length $\| \delta^1 \|$ measures the speed at which thefirst hidden layer is learning, while the length $\| \delta^2 \|$measures the speed at which the second hidden layer is learning.</p><p>With these definitions, and in the same configuration as was plottedabove, we find $\| \delta^1 \| = 0.07\ldots$ and $\| \delta^2 \| =0.31\ldots$.  So this confirms our earlier suspicion: the neurons inthe second hidden layer really are learning much faster than theneurons in the first hidden layer.</p><p>What happens if we add more hidden layers?  If we have three hiddenlayers, in a $[784, 30, 30, 30, 10]$ network, then the respectivespeeds of learning turn out to be 0.012, 0.060, and 0.283.  Again,earlier hidden layers are learning much slower than later hiddenlayers.  Suppose we add yet another layer with $30$ hidden neurons.In that case, the respective speeds of learning are 0.003, 0.017,0.070, and 0.285.  The pattern holds: early layers learn slower thanlater layers.</p><p>We've been looking at the speed of learning at the start of training,that is, just after the networks are initialized.  How does the speedof learning change as we train our networks?  Let's return to look atthe network with just two hidden layers.  The speed of learningchanges as follows:</p><p><center><img src="images/training_speed_2_layers.png" width="500px"></center></p><p>To generate these results, I used batch gradient descent with just1,000 training images, trained over 500 epochs.  This is a bitdifferent than the way we usually train - I've used no mini-batches,and just 1,000 training images, rather than the full 50,000 imagetraining set.  I'm not trying to do anything sneaky, or pull the woolover your eyes, but it turns out that using mini-batch stochasticgradient descent gives much noisier (albeit very similar, when youaverage away the noise) results.  Using the parameters I've chosen isan easy way of smoothing the results out, so we can see what's goingon.</p><p>In any case, as you can see the two layers start out learning at verydifferent speeds (as we already know).  The speed in both layers thendrops very quickly, before rebounding.  But through it all, the firsthidden layer learns much more slowly than the second hidden layer.</p><p>What about more complex networks?  Here's the results of a similarexperiment, but this time with three hidden layers (a $[784, 30, 30,30, 10]$ network):</p><p><center><img src="images/training_speed_3_layers.png" width="500px"></center></p><p>Again, early hidden layers learn much more slowly than later hiddenlayers.  Finally, let's add a fourth hidden layer (a $[784, 30, 30,30, 30, 10]$ network), and see what happens when we train:</p><p><center><img src="images/training_speed_4_layers.png" width="500px"></center></p><p>Again, early hidden layers learn much more slowly than later hiddenlayers.  In this case, the first hidden layer is learning roughly 100times slower than the final hidden layer.  No wonder we were havingtrouble training these networks earlier!</p><p>We have here an important observation: in at least some deep neuralnetworks, the gradient tends to get smaller as we move backwardthrough the hidden layers.  This means that neurons in the earlierlayers learn much more slowly than neurons in later layers.  And whilewe've seen this in just a single network, there are fundamentalreasons why this happens in many neural networks.  The phenomenon isknown as the <em>vanishing gradient problem</em>*<span class="marginnote">*See  <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.7321">Gradient    flow in recurrent nets: the difficulty of learning long-term    dependencies</a>, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi,  and J&uuml;rgen Schmidhuber (2001).  This paper studied recurrent  neural nets, but the essential phenomenon is the same as in the  feedforward networks we are studying.  See also Sepp Hochreiter's  earlier Diploma Thesis,  <a href="http://www.idsia.ch/&#126;juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf">Untersuchungen    zu dynamischen neuronalen Netzen</a> (1991, in German).</span>.</p><p>Why does the vanishing gradient problem occur?  Are there ways we canavoid it?  And how should we deal with it in training deep neuralnetworks?  In fact, we'll learn shortly that it's not inevitable,although the alternative is not very attractive, either: sometimes thegradient gets much larger in earlier layers!  This is the<em>exploding gradient problem</em>, and it's not much better news thanthe vanishing gradient problem.  More generally, it turns out that thegradient in deep neural networks is <em>unstable</em>, tending to eitherexplode or vanish in earlier layers.  This instability is afundamental problem for gradient-based learning in deep neuralnetworks.  It's something we need to understand, and, if possible,take steps to address.</p><p>One response to vanishing (or unstable) gradients is to wonder ifthey're really such a problem.  Momentarily stepping away from neuralnets, imagine we were trying to numerically minimize a function $f(x)$of a single variable.  Wouldn't it be good news if the derivative$f'(x)$ was small?  Wouldn't that mean we were already near anextremum?  In a similar way, might the small gradient in early layersof a deep network mean that we don't need to do much adjustment of theweights and biases?</p><p>Of course, this isn't the case.  Recall that we randomly initializedthe weight and biases in the network.  It is extremely unlikely ourinitial weights and biases will do a good job at whatever it is wewant our network to do.  To be concrete, consider the first layer ofweights in a $[784, 30, 30, 30, 10]$ network for the MNIST problem.The random initialization means the first layer throws away mostinformation about the input image.  Even if later layers have beenextensively trained, they will still find it extremely difficult toidentify the input image, simply because they don't have enoughinformation.  And so it can't possibly be the case that not muchlearning needs to be done in the first layer.  If we're going to traindeep networks, we need to figure out how to address the vanishinggradient problem.</p><p><h3><a name="what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</a></h3></p><p>To get insight into why the vanishing gradient problem occurs, let'sconsider the simplest deep neural network: one with just a singleneuron in each layer.  Here's a network with three hidden layers:<center>  <img src="images/tikz37.png"/></center></p><p>Here, $w_1, w_2, \ldots$ are the weights, $b_1, b_2, \ldots$ are thebiases, and $C$ is some cost function.  Just to remind you how thisworks, the output $a_j$ from the $j$th neuron is $\sigma(z_j)$, where$\sigma$ is the usual <a href="chap1.html#sigmoid_neurons">sigmoid  activation function</a>, and $z_j = w_{j} a_{j-1}+b_j$ is the weightedinput to the neuron.  I've drawn the cost $C$ at the end to emphasizethat the cost is a function of the network's output, $a_4$: if theactual output from the network is close to the desired output, thenthe cost will be low, while if it's far away, the cost will be high.</p><p>We're going to study the gradient $\partial C / \partial b_1$associated to the first hidden neuron.  We'll figure out an expressionfor $\partial C / \partial b_1$, and by studying that expression we'llunderstand why the vanishing gradient problem occurs.</p><p>I'll start by simply showing you the expression for $\partial C /\partial b_1$.  It looks forbidding, but it's actually got a simplestructure, which I'll describe in a moment.  Here's the expression(ignore the network, for now, and note that $\sigma'$ is just thederivative of the $\sigma$ function):</p><p><center>  <img src="images/tikz38.png"/></center></p><p>The structure in the expression is as follows: there is a$\sigma'(z_j)$ term in the product for each neuron in the network; aweight $w_j$ term for each weight in the network; and a final$\partial C / \partial a_4$ term, corresponding to the cost functionat the end.  Notice that I've placed each term in the expression abovethe corresponding part of the network.  So the network itself is amnemonic for the expression.</p><p>You're welcome to take this expression for granted, and skip to the<a href="chap5.html#discussion_why">discussion of how it relates to the vanishing  gradient problem</a>.  There's no harm in doing this, since theexpression is a special case of our<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">earlier  discussion of backpropagation</a>.  But there's also a simpleexplanation of why the expression is true, and so it's fun (andperhaps enlightening) to take a look at that explanation.</p><p>Imagine we make a small change $\Delta b_1$ in the bias $b_1$.  Thatwill set off a cascading series of changes in the rest of the network.First, it causes a change $\Delta a_1$ in the output from the firsthidden neuron.  That, in turn, will cause a change $\Delta z_2$ in theweighted input to the second hidden neuron.  Then a change $\Deltaa_2$ in the output from the second hidden neuron.  And so on, all theway through to a change $\Delta C$ in the cost at the output.  We have<a class="displaced_anchor" name="eqtn114"></a>\begin{eqnarray}  \frac{\partial C}{\partial b_1} \approx \frac{\Delta C}{\Delta b_1}.\tag{114}\end{eqnarray}This suggests that we can figure out an expression for the gradient$\partial C / \partial b_1$ by carefully tracking the effect of eachstep in this cascade.</p><p>To do this, let's think about how $\Delta b_1$ causes the output $a_1$from the first hidden neuron to change.  We have $a_1 = \sigma(z_1) =\sigma(w_1 a_0 + b_1)$, so<a class="displaced_anchor" name="eqtn115"></a><a class="displaced_anchor" name="eqtn116"></a>\begin{eqnarray}  \Delta a_1 & \approx &   \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \tag{115}\\  & = & \sigma'(z_1) \Delta b_1.\tag{116}\end{eqnarray}That $\sigma'(z_1)$ term should look familiar: it's the first term inour claimed expression for the gradient $\partial C / \partial b_1$.Intuitively, this term converts a change $\Delta b_1$ in the bias intoa change $\Delta a_1$ in the output activation.  That change $\Deltaa_1$ in turn causes a change in the weighted input $z_2 = w_2 a_1 +b_2$ to the second hidden neuron:<a class="displaced_anchor" name="eqtn117"></a><a class="displaced_anchor" name="eqtn118"></a>\begin{eqnarray}  \Delta z_2 & \approx &  \frac{\partial z_2}{\partial a_1} \Delta a_1 \tag{117}\\  & = & w_2 \Delta a_1.\tag{118}\end{eqnarray}Combining our expressions for $\Delta z_2$ and $\Delta a_1$, we seehow the change in the bias $b_1$ propagates along the network toaffect $z_2$:<a class="displaced_anchor" name="eqtn119"></a>\begin{eqnarray}\Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1.\tag{119}\end{eqnarray}Again, that should look familiar: we've now got the first two terms inour claimed expression for the gradient $\partial C / \partial b_1$.</p><p>We can keep going in this fashion, tracking the way changes propagatethrough the rest of the network.  At each neuron we pick up a$\sigma'(z_j)$ term, and through each weight we pick up a $w_j$ term.The end result is an expression relating the final change $\Delta C$in cost to the initial change $\Delta b_1$ in the bias:<a class="displaced_anchor" name="eqtn120"></a>\begin{eqnarray}\Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4} \Delta b_1.\tag{120}\end{eqnarray}Dividing by $\Delta b_1$ we do indeed get the desired expression forthe gradient:<a class="displaced_anchor" name="eqtn121"></a>\begin{eqnarray}\frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots\sigma'(z_4) \frac{\partial C}{\partial a_4}.\tag{121}\end{eqnarray}</p><p><a id="discussion_why"></a></p><p><strong>Why the vanishing gradient problem occurs:</strong> To understand whythe vanishing gradient problem occurs, let's explicitly write out theentire expression for the gradient:<a class="displaced_anchor" name="eqtn122"></a>\begin{eqnarray}\frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}.\tag{122}\end{eqnarray}Excepting the very last term, this expression is a product of terms ofthe form $w_j \sigma'(z_j)$.  To understand how each of those termsbehave, let's look at a plot of the function $\sigma'$: <div  id="sigmoid_prime_graph"><a  name="sigmoid_prime_graph"></a></div><script type="text/javascript"  src="js/d3.v3.min.js"></script></p><p>The derivative reaches a maximum at $\sigma'(0) = 1/4$.  Now, if weuse our <a href="chap3.html#weight_initialization">standard approach</a>to initializing the weights in the network, then we'll choose theweights using a Gaussian with mean $0$ and standard deviation $1$.  Sothe weights will usually satisfy $|w_j| < 1$.  Putting theseobservations together, we see that the terms $w_j \sigma'(z_j)$ willusually satisfy $|w_j \sigma'(z_j)| < 1/4$.  And when we take aproduct of many such terms, the product will tend to exponentiallydecrease: the more terms, the smaller the product will be.  This isstarting to smell like a possible explanation for the vanishinggradient problem.</p><p>To make this all a bit more explicit, let's compare the expression for$\partial C / \partial b_1$ to an expression for the gradient withrespect to a later bias, say $\partial C / \partial b_3$.  Of course,we haven't explicitly worked out an expression for $\partial C/ \partial b_3$, but it follows the same pattern described above for$\partial C / \partial b_1$.  Here's the comparison of the twoexpressions:<center>  <img src="images/tikz39.png"/></center>The two expressions share many terms.  But the gradient $\partial C/ \partial b_1$ includes two extra terms each of the form $w_j\sigma'(z_j)$. As we've seen, such terms are typically less than $1/4$in magnitude.  And so the gradient $\partial C / \partial b_1$ willusually be a factor of $16$ (or more) smaller than $\partial C/ \partial b_3$.  This is the essential origin of the vanishinggradient problem.</p><p>Of course, this is an informal argument, not a rigorous proof that thevanishing gradient problem will occur.  There are several possibleescape clauses.  In particular, we might wonder whether the weights$w_j$ could grow during training.  If they do, it's possible the terms$w_j \sigma'(z_j)$ in the product will no longer satisfy $|w_j\sigma'(z_j)| < 1/4$.  Indeed, if the terms get large enough -greater than $1$ - then we will no longer have a vanishing gradientproblem.  Instead, the gradient will actually grow exponentially as wemove backward through the layers.  Instead of a vanishing gradientproblem, we'll have an exploding gradient problem.</p><p><strong>The exploding gradient problem:</strong> Let's look at an explicitexample where exploding gradients occur.  The example is somewhatcontrived: I'm going to fix parameters in the network in just theright way to ensure we get an exploding gradient.  But even though theexample is contrived, it has the virtue of firmly establishing thatexploding gradients aren't merely a hypothetical possibility, theyreally can happen.</p><p>There are two steps to getting an exploding gradient.  First, wechoose all the weights in the network to be large, say $w_1 = w_2 =w_3 = w_4 = 100$.  Second, we'll choose the biases so that the$\sigma'(z_j)$ terms are not too small.  That's actually pretty easyto do: all we need do is choose the biases to ensure that the weightedinput to each neuron is $z_j = 0$ (and so $\sigma'(z_j) = 1/4$).  So,for instance, we want $z_1 = w_1 a_0 + b_1 = 0$.  We can achieve thisby setting $b_1 = -100 * a_0$.  We can use the same idea to select theother biases.  When we do this, we see that all the terms $w_j\sigma'(z_j)$ are equal to $100 * \frac{1}{4} = 25$.  With thesechoices we get an exploding gradient.</p><p><strong>The unstable gradient problem:</strong> The fundamental problem hereisn't so much the vanishing gradient problem or the exploding gradientproblem.  It's that the gradient in early layers is the product ofterms from all the later layers.  When there are many layers, that'san intrinsically unstable situation.  The only way all layers canlearn at close to the same speed is if all those products of termscome close to balancing out.  Without some mechanism or underlyingreason for that balancing to occur, it's highly unlikely to happensimply by chance.  In short, the real problem here is that neuralnetworks suffer from an <em>unstable gradient problem</em>.  As aresult, if we use standard gradient-based learning techniques,different layers in the network will tend to learn at wildly differentspeeds.</p><p><h4><a name="exercise_255808"></a><a href="chap5.html#exercise_255808">Exercise</a></h4><ul><li> In our discussion of the vanishing gradient problem, we made use  of the fact that $|\sigma'(z)| < 1/4$.  Suppose we used a different  activation function, one whose derivative could be much larger.  Would that help us avoid the unstable gradient problem?</ul></p><p><strong>The prevalence of the vanishing gradient problem:</strong> We've seenthat the gradient can either vanish or explode in the early layers ofa deep network.  In fact, when using sigmoid neurons the gradient willusually vanish.  To see why, consider again the expression $|w\sigma'(z)|$.  To avoid the vanishing gradient problem we need $|w\sigma'(z)| \geq 1$.  You might think this could happen easily if $w$is very large.  However, it's more difficult than it looks.  Thereason is that the $\sigma'(z)$ term also depends on $w$: $\sigma'(z)= \sigma'(wa +b)$, where $a$ is the input activation.  So when we make$w$ large, we need to be careful that we're not simultaneously making$\sigma'(wa+b)$ small.  That turns out to be a considerableconstraint.  The reason is that when we make $w$ large we tend to make$wa+b$ very large.  Looking at the graph of $\sigma'$ you can see thatthis puts us off in the "wings" of the $\sigma'$ function, where ittakes very small values.  The only way to avoid this is if the inputactivation falls within a fairly narrow range of values (thisqualitative explanation is made quantitative in the first problembelow).  Sometimes that will chance to happen.  More often, though, itdoes not happen.  And so in the generic case we have vanishinggradients.</p><p><h4><a name="problems_778071"></a><a href="chap5.html#problems_778071">Problems</a></h4><ul><li> Consider the product $|w \sigma'(wa+b)|$.  Suppose $|w  \sigma'(wa+b)| \geq 1$.  (1) Argue that this can only ever occur if  $|w| \geq 4$.  (2) Supposing that $|w| \geq 4$, consider the set of  input activations $a$ for which $|w \sigma'(wa+b)| \geq 1$.  Show  that the set of $a$ satisfying that constraint can range over an  interval no greater in width than  <a class="displaced_anchor" name="eqtn123"></a>\begin{eqnarray}    \frac{2}{|w|} \ln\left( \frac{|w|(1+\sqrt{1-4/|w|})}{2}-1\right).  \tag{123}\end{eqnarray}  (3) Show numerically that the above expression bounding the width of  the range is greatest at $|w| \approx 6.9$, where it takes a value  $\approx 0.45$.  And so even given that everything lines up just  perfectly, we still have a fairly narrow range of input activations  which can avoid the vanishing gradient problem.</p><p><li><strong>Identity neuron:</strong> <a id="identity_neuron"></a> Consider a  neuron with a single input, $x$, a corresponding weight, $w_1$, a  bias $b$, and a weight $w_2$ on the output.  Show that by choosing  the weights and bias appropriately, we can ensure $w_2 \sigma(w_1  x+b) \approx x$ for $x \in [0, 1]$.  Such a neuron can thus be used  as a kind of identity neuron, that is, a neuron whose output is the  same (up to rescaling by a weight factor) as its input.  <em>Hint:    It helps to rewrite $x = 1/2+\Delta$, to assume $w_1$ is small,    and to use a Taylor series expansion in $w_1 \Delta$.</em></ul></p><p><h3><a name="unstable_gradients_in_more_complex_networks"></a><a href="chap5.html#unstable_gradients_in_more_complex_networks">Unstable gradients in more complex networks</a></h3></p><p>We've been studying toy networks, with just one neuron in each hiddenlayer.  What about more complex deep networks, with many neurons ineach hidden layer?</p><p><center><img src="images/tikz40.png"/></center></p><p>In fact, much the same behaviour occurs in such networks.  In theearlier chapter on backpropagation we saw that the gradient in the$l$th layer of an $L$ layer network<a href="chap2.html#alternative_backprop">is given by</a>:</p><p><a class="displaced_anchor" name="eqtn124"></a>\begin{eqnarray}  \delta^l = \Sigma'(z^l) (w^{l+1})^T \Sigma'(z^{l+1}) (w^{l+2})^T \ldots  \Sigma'(z^L) \nabla_a C\tag{124}\end{eqnarray}</p><p>Here, $\Sigma'(z^l)$ is a diagonal matrix whose entries are the$\sigma'(z)$ values for the weighted inputs to the $l$th layer.  The$w^l$ are the weight matrices for the different layers.  And $\nabla_aC$ is the vector of partial derivatives of $C$ with respect to theoutput activations.</p><p>This is a much more complicated expression than in the single-neuroncase.  Still, if you look closely, the essential form is very similar,with lots of pairs of the form $(w^j)^T \Sigma'(z^j)$.  What's more,the matrices $\Sigma'(z^j)$ have small entries on the diagonal, nonelarger than $\frac{1}{4}$. Provided the weight matrices $w^j$ aren'ttoo large, each additional term $(w^j)^T \Sigma'(z^l)$ tends to makethe gradient vector smaller, leading to a vanishing gradient.  Moregenerally, the large number of terms in the product tends to lead toan unstable gradient, just as in our earlier example.  In practice,empirically it is typically found in sigmoid networks that gradientsvanish exponentially quickly in earlier layers.  As a result, learningslows down in those layers.  This slowdown isn't merely an accident oran inconvenience: it's a fundamental consequence of the approach we'retaking to learning.</p><p><h3><a name="other_obstacles_to_deep_learning"></a><a href="chap5.html#other_obstacles_to_deep_learning">Other obstacles to deep learning</a></h3></p><p>In this chapter we've focused on vanishing gradients - and, moregenerally, unstable gradients - as an obstacle to deep learning.  Infact, unstable gradients are just one obstacle to deep learning,albeit an important fundamental obstacle.  Much ongoing research aimsto better understand the challenges that can occur when training deepnetworks.  I won't comprehensively summarize that work here, but justwant to briefly mention a couple of papers, to give you the flavor ofsome of the questions people are asking.</p><p>As a first example, in 2010 Glorot andBengio*<span class="marginnote">*<a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding    the difficulty of training deep feedforward neural networks</a>, by  Xavier Glorot and Yoshua Bengio (2010). See also the earlier  discussion of the use of sigmoids in  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998).</span>found evidence suggesting that the use of sigmoid activation functionscan cause problems training deep networks.  In particular, they foundevidence that the use of sigmoids will cause the activations in thefinal hidden layer to saturate near $0$ early in training,substantially slowing down learning.  They suggested some alternativeactivation functions, which appear not to suffer as much from thissaturation problem.</p><p>As a second example, in 2013 Sutskever, Martens, Dahl andHinton*<span class="marginnote">*<a href="http://www.cs.toronto.edu/&#126;hinton/absps/momentum.pdf">On    the importance of initialization and momentum in deep learning</a>,  by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton  (2013).</span> studied the impact on deep learning of both the randomweight initialization and the momentum schedule in momentum-basedstochastic gradient descent.  In both cases, making good choices madea substantial difference in the ability to train deep networks.</p><p>These examples suggest that "What makes deep networks hard totrain?" is a complex question.  In this chapter, we've focused on theinstabilities associated to gradient-based learning in deep networks.The results in the last two paragraphs suggest that there is also arole played by the choice of activation function, the way weights areinitialized, and even details of how learning by gradient descent isimplemented.  And, of course, choice of network architecture and otherhyper-parameters is also important. Thus, many factors can play a rolein making deep networks hard to train, and understanding all thosefactors is still a subject of ongoing research.  This all seems ratherdownbeat and pessimism-inducing.  But the good news is that in thenext chapter we'll turn that around, and develop several approaches todeep learning that to some extent manage to overcome or route aroundall these challenges.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><script src="js/misc.js" type="text/javascript"></script><script src="js/canvas.js" type="text/javascript"></script><script src="js/neuron.js" type="text/javascript"></script><script src="js/chap5.js" type="text/javascript"></script></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2015

<br/>
<br/>

This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>.  This means you're free to copy, share, and
build on this book, but not to sell it.  If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Thu Dec 26 15:26:33 2019
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>